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1 Converting thread-level parallelism to instruction-level parallelism via simultaneous 
multithreading 

Jack L. Lo, Joel S. Emer, Henry M. Levy, Rebecca L. Stamm, Dean M. Tullsen, S. J. Eggers 
August 1997 ACM Transactions on Computer Systems (TOCS), volume is issue 3 

Additional Information: full citation , abstract , references , citings , index 
terms , review 



Full text available: *g pdf(526.39 KB) 



To achieve high performance, contemporary computer systems rely on two forms of 
parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue 
super-scalar processors exploit ILP by executing multiple instructions from a single program 
in a single cycle. Multiprocessors (MP) exploit TLP by executing different threads in parallel 
on different processors. Unfortunately, both parallel processing styles statically partition 
processor resources, thus preventing t ... 

Keywords: cache interference, instruction-level parallelism, multiprocessors, 
multithreading, simultaneous multithreading, thread-level parallelism 



A new g uaranteed heuristic for the software pi pelining problem 
Pierre-Yves Calland, Alain Darte, Yves Robert 

January 1996 Proceedings of the 10th international conference on Supercomputing 

Full text available: ^ pdf (892.93 KB ) Additional Information: full citation , references , index terms 



Keywords: circuit retiming, cyclic scheduling, guaranteed heuristic, list scheduling, 
software pipelining 



3 Al phaSort: a RISC machine sort 

Chris Nyberg, Tom Barclay, Zarka Cvetanovic, Jim Gray, Dave Lomet 

May 1994 ACM SIGMOD Record , Proceedings of the 1994 ACM SIGMOD international 

conference on Management of data, volume 23 issue 2 
Full text available: 1jjjl pdf(1.17 MB) Additional Information: full citation , abstract , references , citings , index 

terms 

A new sort algorithm, called AlphaSort, demonstrates that commodity processors and disks 
can handle commercial batch workloads. Using Alpha AXP processors, commodity memory, 
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and arrays of SCSI disks, AlphaSort runs the industry-standard sort benchmark in seven 
seconds. This beats the best published record on a 32-cpu 32-disk Hypercube by 8:1. On 
another benchmark, AlphaSort sorted more than a gigabyte in a minute. AlphaSort is a 
cache-sensitive memory-intensive sort algorithm. It ... 

4 S pecial system-oriented section: the best of SIGMOD '94: AlphaSort: a cache-sensitive J 
parallel external sort 

Chris Nyberg, Tom Barclay, Zarka Cvetanovic, Jim Gray, Dave Lomet 

October 1995 The VLDB Journal — The International Journal on Very Large Data Bases, 

Volume 4 Issue 4 

Full text available: ^ pdf(1.37 MB) Additional Information: full citation , abstract , references , citings 

A new sort algorithm, called AlphaSort, demonstrates that commodity processors and disks 
can handle commercial batch workloads. Using commodity processors, memory, and arrays 
of SCSI disks, AlphaSort runs the industry-standard sort benchmark in seven seconds. This 
beats the best published record on a 32-CPU 32-disk Hypercube by 8:1. On another 
benchmark, AlphaSort sorted more than a gigabyte in one minute. AlphaSort is a cache- 
sensitive, memory-intensive sort algorithm. We argue that modern arch ... 

Keywords: Alpha, Dec 7000, cache, disk, memory, parallel, sort, striping 



5 Adaptive two-level thread mana g ement for fast MPI execution on shared memory 
machines 

Kai Shen, Hong Tang, Tao Yang 

January 1999 Proceedings of the 1999 ACM/IEEE conference on Supercomputing 
(CDROM) 

Full text available: ^ pdf (152.63 KB) Additional Information: full citation , references , citings , index terms 



6 Im plementation of a parallel unstructured Eu ler solver on shared and distributed 
memory architectures 

D. J. Mavriplis, R. Das, R. E. Vermeland, J. Saltz 

December 1992 Proceedings of the 1992 ACM/IEEE conference on Supercomputing 

Full text available: ^ pdf(1.55 MB) Additional Information: full citation , references, citings , index terms 



7 Gl-cube: an architecture for volumetric g lobal illumination and renderin g 
Frank Dachille, Arie Kaufman 

August 2000 Proceedings of the ACM SIGGRAPH/EUROGRAPHICS workshop on 
Graphics hardware 

Full text available: « pdf( 650.91 KB ) Additional Information: full citation, abstract, references, citings, index 

terms 

The power and utility of volume rendering is increased by global illumination. We present a 
hardware architecture, Gl-Cube, designed to accelerate volume rendering, empower 
volumetric global illumination, and enable a host of ray-based volumetric processing. The 
algorithm reorders ray processing based on a partitioning of the volume. A cache enables 
efficient processing of coherent rays within a hardware pipeline. We study the flexibility and 
performance of this new architecture using both ... 

Keywords: hardware accelerator, volume processing, volume rendering, volumetric global 
illumination, volumetric ray tracing 
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8 A compilation-based software estimation scheme for hardware/software co-simulation Q 
Marcello Lajolo, Mihai Lazarescu, Alberto Sangiovanni-Vincentelli 

March 1999 Proceedings of the seventh international workshop on Hardware/ software 
codesign 

Full text available: ^ pdf(437.23 KB) Additional Information: full citation , references , citing s, index terms 



Keywords: compilation, delay modeling, software estimation 



9 A pplication restructuring and performance portability on shared virtual memory and 
hardware-coherent multiprocessors 

Dongming Jiang, Hongzhang Shan, Jaswinder Pal Singh 

June 1997 ACM SIGPLAN Notices , Proceedings of the sixth ACM SIGPLAN symposium 

on Principles and practice of parallel programming, volume 32 issue 7 
Full text available: 15) pdf(1 .59 MB) Additional Information: full citation , abstract , references , citings , index 
l^j terms 

The performance portability of parallel programs across a wide range of emerging coherent 
shared address space systems is not well understood. Programs that run well on efficient, 
hardware cache-coherent systems often do not perform well on less optimal or more 
commodity-based communication architectures. This paper studies this issue of performance 
portability, with the commodity communication architecture of interest being page-grained 
shared virtual memory. We begin with applications that per ... 

10 Compile/run-time support for threaded MPI execution on multiproqrammed shared 
memory machines 

Hong Tang, Kai Shen, Tao Yang 

May 1999 ACM SIGPLAN Notices , Proceedings of the seventh ACM SIGPLAN 

symposium on Principles and practice of parallel programming, volume 34 issue 

8 

Full text available: flj pdf(1 .54 MB) Additional Information: full citation , abstract, refere nces, citings, index 
" t erms 

MPI is a message-passing standard widely used for developing high-performance parallel 
applications. Because of the restriction in the MPI computation model, conventional 
implementations on shared memory machines map each MPI node to an OS process, which 
suffers serious performance degradation in the presence of multiprogramming, especially 
when a space/time sharing policy is employed in OS job scheduling. In this paper, we study 
compile-time and run-time support for MPI by using threads and dem ... 

11 The Totem multiple-ring ordering and topology maintenance protocol | 
D. A. Agarwal, L. E. Moser, P. M. Melliar-Smith, R. K. Budhia 

May 1998 ACM Transactions on Computer Systems (TOCS), volume 16 issue 2 

Full text available* fi3 pdf(367 16 KB) Additional Information: full citation , abstract , references , citings, index 

' terms , review 

The Totem multiple-ring protocol provides reliable totally ordered delivery of messages 
across multiple local-area networks interconnected by gateways. This consistent message 
order is maintained in the presence of network partitioning and remerging, and of processor 
failure and recovery. The protocol provides accurate topology change information as part of 
the global total order of messages. It addresses the issue of scalability and achieves a 
latency that increases logarithmically with ... 

Keywords: Lamport timestamp, network partitioning, reliable delivery, topology 
maintenance, total ordering, virtual synchrony 
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12 A flexible operation execution model for shared distributed objects 
Saniya Ben Hassen, Irina Athanasiu, Henri E. Bal 

October 1996 ACM SIGPLAN Notices , Proceedings of the 11th ACM SIGPLAN 

conference on Object-oriented programming, systems, languages, and 

applications, Volume 31 Issue 10 
Full text available- 1Sl odf(2 30 MB) Additional Information: full citation , abstract , references , citings, index 
' ^ terms 

Many parallel and distributed programming models are based on some form of shared 
objects, which may be represented in various ways (e.g., single-copy, replicated, and 
partitioned objects). Also, many different operation execution strategies have been designed 
for each representation. In programming systems that use multiple representations 
integrated in a single object model, one way to provide multiple execution strategies is to 
implement each strategy independently from the others. How ... 

13 Session 4: communications libraries: SLICC: a low latency interface for collective 
communications 

Allan D. Knies, William J. Harrod, F. Ray Barriuso, George B. Adams 

November 1994 Proceedings of the 1994 ACM/IEEE conference on Supercomputing 

Full text available: ^ pdf(658.77 KB) Additional Information: full citation , abstract , references 

Several recent parallel computers have implemented logically shared, physically distributed 
memory systems which allow processors to directly access memory in other processors 
without interrupting the referenced PE. Because this kind of architecture provides greater 
flexibility for interprocessor communications than private address space computers, different 
software models can be developed to take advantage of these machines. In this paper, we 
describe a low-level collective communications inte ... 

1 4 Generalized FLP im possibilit y result for t-resilient asynchronous computations | 
Elizabeth Borowsky, Eli Gafni 

June 1993 Proceedings of the twenty-fifth annual ACM symposium on Theory of 
computing 

Full text available: ^ pdf(980.66 KB) Additional Information: full citation , references , citings , index terms 



1 5 High performance data mining (tutorial PM-3) Q 
Vipin Kumar, Mohammed Zaki 

August 2000 Tutorial notes of the sixth ACM SIGKDD international conference on 
Knowledge discovery and data mining 

Full text available: ^ pdf(8.06 MB) Additional Information: full citation , references , index terms 



16 Hi gh performance synchronization algorithms for multiprogrammed multiprocessors 
Robert W. Wisniewski, Leonidas I. Kontothanassis, Michael L. Scott 

August 1995 ACM SIGPLAN Notices , Proceedings of the fifth ACM SIGPLAN symposium 

on Principles and practice of parallel programming, volume 30 issue 8 
Full text available- S Ddf(91 5 40 KB) Additional Information: full citation , abstract , references, citings, index 
" mM ~^ ! terms 

Scalable busy-wait synchronization algorithms are essential for achieving good parallel 
program performance on large scale multiprocessors. Such algorithms include mutual 
exclusion locks, reader-writer locks, and barrier synchronization. Unfortunately, scalable 
synchronization algorithms are particularly sensitive to the effects of multiprogramming: 
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their performance degrades sharply when processors are shared among different 
applications, or even among processes of the same application. In ... 

17 Communication and computation performance of the CM-5 I I 

T. T. Kwan, B. K. Totty, D. A. Reed 

December 1993 Proceedings of the 1993 ACM/IEEE conference on Supercomputing 

Full text available: ^ pdf(810.49 KB) Additional Information: full citation , references , citings , index terms 



18 An execution model for distributed object-oriented computation I I 
Edward H. Bensley, Thomas J. Brando, Myra Jean Prelle 

January 1988 ACM SIGPLAN Notices , Conference proceedings on Object-oriented 
programming systems, languages and applications, volume 23 issue n 
Full text available* fi3pdf(786 18 KB) Additional Information: full citation , abstract , references , citings , index 
' ! terms 

This paper describes an execution model being developed for distributed object-oriented in a 
message-passing multiple-instruction/multiple-data-stream (MIMD) environment. The 
objective is to execute an object-oriented program as concurrently as possible. Some 
opportunities for concurrency can be identified explicitly by the programmer. Others can be 
identified at compile time. There are some opportunities for concurrency, however, that can 
only be discovered at runtime because they are data ... 

19 Totem: a fault-tolerant multicast group communication system I I 
L. E. Moser, P. M. Melliar-Smith, D. A. Agarwal, R. K. Budhia, C. A. Lingley-Papadopoulos 

April 1996 Communications of the ACM, volume 39 issue 4 

Full text available; fiQ pdf(342 07 KB) Additional Information: full citation , references , citings , index terms . 
' ^- H ~ A ! review 



20 Model refinement for hardware-software codesi gn I I 

Jie Gong, Daniel D. Gajski, Smita Bakshi 

January 1997 ACM Transactions on Design Automation of Electronic Systems 
(TODAES), Volume 2 Issue 1 

Full text available- ffl pdf(436 53 KB) Additional Information: full citation , abstract , references , index terms . 

1 review 

Hardware-software codesign, which implements a given specification with a set of system 
components such as ASICs and processors, includes several key tasks such as system 
component allocation, functional partitioning, quality metrics estimation, and model 
refinement. In this work, we focus on the model refinement task which transforms a 
specification from an original functional model to a refined implementation model. First, we 
categorize several commonly used implementation models and desc ... 

Keywords: functional model, implementation model, model refinement, sofware-hardware 
codesign 
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1 Converting thread-level parallelism to instruction-level parallelism via simultaneous 
multithreading 

Jack L Lo, Joel S. Emer, Henry M. Levy, Rebecca L Stamm, Dean M. Tullsen, S. J. Eggers 
August 1997 ACM Transactions on Computer Systems (TOCS), volume 15 issue 3 

Additional Information: full citation , abstract , references , citings , index 
terms , review 



Full text available: pdf(526.39 KB) 



To achieve high performance, contemporary computer systems rely on two forms of 
parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue 
super-scalar processors exploit ILP by executing multiple instructions from a single program 
in a single cycle. Multiprocessors (MP) exploit TLP by executing different threads in parallel 
on different processors. Unfortunately, both parallel processing styles statically partition 
processor resources, thus preventing t ... 

Keywords: cache interference, instruction-level parallelism, multiprocessors, 
multithreading, simultaneous multithreading, thread-level parallelism 
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Al phaSort: a RISC machine sort 

Chris Nyberg, Tom Barclay, Zarka Cvetanovic, Jim Gray, Dave Lomet 

May 1994 ACM SIGMOD Record , Proceedings of the 1994 ACM SIGMOD international 

conference on Management of data, volume 23 issue 2 
Full text available* f£\ odfd 1 7 MB) Additional Information: full citation , abstract , references , citings , index 
' l£J terms 

A new sort algorithm, called AlphaSort, demonstrates that commodity processors and disks 
can handle commercial batch workloads. Using Alpha AXP processors, commodity memory, 
and arrays of SCSI disks, AlphaSort runs the industry-standard sort benchmark in seven 
seconds. This beats the best published record on a 32-cpu 32-disk Hypercube by 8:1. On 
another benchmark, AlphaSort sorted more than a gigabyte in a minute.AlphaSort is a 
cache-sensitive memory-intensive sort algorithm. It ... 

Gl-cube: an architecture for volumetric g lobal illumination and renderin g 
Frank Dachille, Arie Kaufman 

August 2000 Proceedings of the ACM SIGGRAPH/EUROGRAPHICS workshop on 
Graphics hardware 

Full text available* Additional Information: full citation , abstract , references , citin gs, index 
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The power and utility of volume rendering is increased by global illumination. We present a 
hardware architecture, Gl-Cube, designed to accelerate volume rendering, empower 
volumetric global illumination, and enable a host of ray-based volumetric processing. The 
algorithm reorders ray processing based on a partitioning of the volume. A cache enables 
efficient processing of coherent rays within a hardware pipeline. We study the flexibility and 
performance of this new architecture using both ... 

Keywords: hardware accelerator, volume processing, volume rendering, volumetric global 
illumination, volumetric ray tracing 



4 Special system-oriented section: the best of SIGMOD '94: AlphaSort: a cache-sensitive J 
parallel external sort 

Chris Nyberg, Tom Barclay, Zarka Cvetanovic, Jim Gray, Dave Lomet 

October 1995 The VLDB Journal — The International Journal on Very Large Data Bases, 

Volume 4 Issue 4 

Full text available: ^ pdfd.37 MB) Additional Information: full citation , abstract , references , citings 

A new sort algorithm, called AlphaSort, demonstrates that commodity processors and disks 
can handle commercial batch workloads. Using commodity processors, memory, and arrays 
of SCSI disks, AlphaSort runs the industry-standard sort benchmark in seven seconds. This 
beats the best published record on a 32-CPU 32-disk Hypercube by 8:1. On another 
benchmark, AlphaSort sorted more than a gigabyte in one minute. AlphaSort is a cache- 
sensitive, memory-intensive sort algorithm. We argue that modern arch ... 

Keywords: Alpha, Dec 7000, cache, disk, memory, parallel, sort, striping 



5 Adaptive two-level thread management for fast MPI execution on shared memory 
machines 

Kai Shen, Hong Tang, Tao Yang 

January 1999 Proceedings of the 1999 ACM/IEEE conference on Supercomputing 
(CDROM) 

Full text available: *Q pdfd 52.63 KB) Additional Information: full citation , references , citings , index terms 
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Compile/run-time support for threaded MPI execution on multiprogrammed shared 

memory machines 

Hong Tang, Kai Shen, Tao Yang 

May 1999 ACM SIGPLAN Notices , Proceedings of the seventh ACM SIGPLAN 

symposium on Principles and practice of parallel programming, volume 34 issue 

8 

Full text available: 1fl pdf(1.54 MB) Additional Information: full citation , abstr act, references , cjtings, index 
• terms 

MPI is a message-passing standard widely used for developing high-performance parallel 
applications. Because of the restriction in the MPI computation model, conventional 
implementations on shared memory machines map each MPI node to an OS process, which 
suffers serious performance degradation in the presence of multiprogramming, especially 
when a space/time sharing policy is employed in OS job scheduling. In this paper, we study 
compile-time and run-time support for MPI by using threads and dem ... 

Generalized FLP impossibility result for t-resilient asynchronous computations 
Elizabeth Borowsky, Eli Gafni 
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8 Communication and computation performance of the CM-5 
T. T. Kwan, B. K. Totty, D. A. Reed 

December 1993 Proceedings of the 1993 ACM/IEEE conference on Supercomputing 
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9 Hi gh performance synchronization algorithms for multiprogrammed multiprocessors 
Robert W. Wisniewski, Leonidas I. Kontothanassis, Michael L. Scott 

August 1995 ACM SIGPLAN Notices , Proceedings of the fifth ACM SIGPLAN symposium 
on Principles and practice of parallel programming, volume 30 issue 8 

Full text available: 1p] pdft915.40 KB) Additional Information: full citation, abstract, references, cjtings, jndex 
" terms 

Scalable busy-wait synchronization algorithms are essential for achieving good parallel 
program performance on large scale multiprocessors. Such algorithms include mutual 
exclusion locks, reader-writer locks, and barrier synchronization. Unfortunately, scalable 
synchronization algorithms are particularly sensitive to the effects of multiprogramming: 
their performance degrades sharply when processors are shared among different 
applications, or even among processes of the same application. In ... 

1 0 A flexible operation execution model for shared distributed ob j ects 
Saniya Ben Hassen, Irina Athanasiu, Henri E. Bal 

October 1996 ACM SIGPLAN Notices , Proceedings of the 11th ACM SIGPLAN 

conference on Object-oriented programming, systems, languages, and 

applications, Volume 31 Issue 10 
Full text available- pdf(2 30 MB) Additional Information: full citation , abstract , references , citings, index 

' terms 

Many parallel and distributed programming models are based on some form of shared 
objects, which may be represented in various ways (e.g., single-copy, replicated, and 
partitioned objects). Also, many different operation execution strategies have been designed 
for each representation. In programming systems that use multiple representations 
integrated in a single object model, one way to provide multiple execution strategies is to 
implement each strategy independently from the others. How ... 

11 Hig h performance data mining (tutorial PM-3 ) 
Vipin Kumar, Mohammed Zaki 

August 2000 Tutorial notes of the sixth ACM SIGKDD international conference on 
Knowledge discovery and data mining 

Full text available: fiB pdf(8.06 MB ) Additional Information: full citation , references , index terms 



12 Model refinement for hardware-software codesign I I 

jie Gong, Daniel D. Gajski, Smita Bakshi 

January 1997 ACM Transactions on Design Automation of Electronic Systems 
(TODAES), Volume 2 Issue 1 

Full text available: « pdf(436.53 KB) Additional Information: full citation , abstract, references , index terms, 
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Hardware-software codesign, which implements a given specification with a set of system 
components such as ASICs and processors, includes several key tasks such as system 
component allocation, functional partitioning, quality metrics estimation, and model 
refinement. In this work, we focus on the model refinement task which transforms a 
specification from an original functional model to a refined implementation model. First, we 
categorize several commonly used implementation models and desc ... 

Keywords: functional model, implementation model, model refinement, sofware-hardware 
codesign 
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This paper describes an execution model being developed for distributed object-oriented in a 
message-passing multiple-instruction/multiple-data-stream (MIMD) environment. The 
objective is to execute an object-oriented program as concurrently as possible. Some 
opportunities for concurrency can be identified explicitly by the programmer. Others can be 
identified at compile time. There are some opportunities for concurrency, however, that can 
only be discovered at runtime because they are data ... 

1 4 Functional divisions in the Pi glet multiprocessor operating system I I 

Steve Muir, Jonathan Smith 

September 1998 Proceedings of the 8th ACM SIGOPS European workshop on Support 
for composing distributed applications 
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